Method and system for determining copy number variation

ABSTRACT

Disclosed are a method and a system for determining genome copy number variation, which relates to the technical field of bioinformatics. The method comprises obtaining reads; determining sequence labels according to the reads; counting the number of sequence labels falling into each window; performing GC correction on the sequence label number of each window and a correction according to an expected sequence label number adjusted by a control set to obtain a corrected sequence label number; selecting a demarcation point with a small significance value as a candidate CNV breaking point; rejecting the least significant candidate CNV breaking point at every turn, updating difference significance values of two candidate CNV breaking points on the left and right of the rejected candidate CNV breaking point and performing cyclic iteration until difference significance values of all candidate CNV breaking points are smaller than a termination threshold value, thereby determining a CNV breaking point. The method and the system the present invention have clinical feasibility, and can precisely detect a micro-deletion/micro-duplication area of 0.5 M under the situation of using data of about 50 M.

TECHNICAL FIELD

Embodiments of the present invention generally relate to the field ofbioinformatics, more particularly to a method of detecting a copy numbervariation (CNV) and a system thereof.

BACKGROUND

CNV is one form of genomic structural variation. A narrow definition ofCNV often refers to a copy number change of a DNA fragment in achromosome. Types and reasons for this form of genomic structuralvariation may include: 1. deletion (terminal deletion, interstitialdeletion); 2. translocation (reciprocal translocation, Robertsoniantranslocation); 3. Inversion; 4. ring chromosome; 5. dicentricchromosome; 6. insert, etc. A broader definition of CNV also includes,for example, structural variations such as chromosome aneuploidy andpartial aneuploidy.

Presently available methods for detecting the copy number variationmainly include high resolution chromosome karyotype analysis, FISH(fluorescence in situ hybridization), Array CGH (array comparativegenomic hybridization), MLPA (Multiplex Ligation-Dependent ProbeAmplification), PCR (Polymerase Chain Reaction), etc, among which FISHdetection is regarded as a gold standard for genetic diagnosis, whichmay effectively be used to detect most of known chromosomal deletions orrepeats. However, these methods generally suffer from low efficiencies,particularly when used for whole-genome scans, which can consume largeamounts of resources or may not be able to detect unknown CNV.

Therefore, there is an urgent need for a new method of detecting a copynumber variation in order to characterize a known site and explore anunknown site in the genome.

SUMMARY

One technical problem to be solved by the present invention is toprovide a method of detecting a copy number variation and a systemthereof, which can accurately detect a copy number variation evenincluding micro-deletion and micro-duplication.

Embodiments of a first general aspect of the present invention relate toa method of detecting a copy number variation. According to embodimentsof the present invention, the method comprises the following steps:

obtaining reads from at least one part of a nucleic acid molecule of asample,

determining uniquely-mapped reads aligned to a (genomic) referencesequence based on the obtained reads,

dividing the genomic reference sequence into a plurality of windows, andcalculating the number of uniquely-mapped reads falling into each of theplurality of windows,

subjecting the number of uniquely-mapped reads falling into each of theplurality of windows to a GC correction, and to a correction based on anexpected number of uniquely-mapped reads adjusted by a control set toobtain a corrected number of uniquely-mapped reads,

calculating a significance value of the difference between two numericalpopulations each consisting of the corrected numbers of uniquely-mappedreads falling into windows on each of the two sides of a demarcationpoint, the demarcation point being a starting point or an ending pointof each of the plurality of windows, to thereby select the demarcationpoint having a smaller significance value as a candidate CNV breakpoint;

calculating a significance value of the difference between two numericalpopulations each consisting of the corrected number of uniquely-mappedreads falling into windows contained within each of two sequences, withone sequence ranging from a given CNV breakpoint to an adjacent upstreamCNV breakpoint, and the other sequence ranging from the given CNVbreakpoint to an adjacent downstream CNV breakpoint, and

removing the candidate CNV breakpoint having the least significance atevery turn and recalculating the significance value for the twocandidate CNV breakpoints adjacent to the removed candidate CNVbreakpoint, performing cyclic iteration until the significance values ofall candidate CNV breakpoints are less than a termination thresholdvalue, to thereby determine the CNV breakpoint.

Optionally, the method can further comprise a step of: subjecting the atleast one part of the nucleic acid molecule of the sample to sequencingto obtain the reads.

Optionally, each of the plurality of windows can contain the same numberof reference unique reads, or each of the plurality of windows can havethe same length.

Optionally, the termination threshold value can be obtained based on thecontrol set consisting of normal samples.

Optionally, the step of subjecting the number of uniquely-mapped readsfalling into each of the plurality of windows to the GC correction andto a correction based on the expected number of uniquely-mapped readsadjusted by a control set to obtain the corrected number ofuniquely-mapped reads can further comprise: grouping the plurality ofwindows based on a GC content, and obtaining a correction coefficientbased on a mean value of the number of uniquely-mapped reads within onegroup and a mean value of the number of uniquely-mapped reads for all ofthe plurality of windows, and then subjecting the number ofuniquely-mapped reads falling into each of the plurality of windows tothe correction, to obtain the GC-corrected number of uniquely-mappedreads.

Optionally, the expected number of uniquely-mapped reads adjusted by thecontrol set can be obtained by following steps:

calculating a ratio of the GC-corrected number of uniquely-mapped readsfalling into each of the plurality of windows in the control set to thetotal number of uniquely-mapped reads;

obtaining a mean value of the ratios for all windows corresponding tothe control set; and

calculating the expect number of uniquely-mapped reads for each of theplurality of windows in the sample based on the obtained mean value ofthe ratios and the total number of uniquely-mapped reads in the sample.

Optionally, after the CNV breakpoint is determined, the method canfurther comprise:

subjecting a sequence between two CNV breakpoints to a confidenceselection, wherein the confidence selection comprises steps of:

determining a confidence interval of the normal corrected number ofuniquely-mapped reads using the control set, based on a distribution ofthe corrected number of uniquely-mapped reads; and

determining an abnormality presenting in the sequence between the twoCNV breakpoints, if the mean value of the corrected number ofuniquely-mapped reads within the sequence is outside the confidenceinterval.

Optionally, the corrected number of uniquely-mapped reads can fit anormal distribution, and the confidence interval can be 95%.

Optionally, a cyclization with a single chromosome or a whole genome canbe performed when selecting the candidate CNV breakpoint.

Optionally, the method can further comprise:

obtaining the sample from a human, the sample comprising amniotic fluidobtained by amniocentesis, villus obtained by chorionic villi sampling,umbilical cord blood obtained by percutaneous umbilical blood sampling,spontaneous miscarrying fetus tissue or human peripheral blood; and/orobtaining a genomic DNA of the sample by a DNA extraction method such asa salting-out method, a column chromatography method, a beads method, ora SDS method; and/or subjecting the genome DNA of the sample to randomfragmenting by enzyme digestion, pulverization, ultrasound, orHydroShear method, to obtain DNA fragments; and/or subjecting the DNAfragments to single-end sequencing or pair-end sequencing, to obtain thereads of the DNA fragments.

Optionally, the method can further comprise adding different indexes toeach of the DNA fragments of the samples, to distinguish differentsamples.

Embodiments of a second general aspect of the present invention providea system for determining a copy number variation. According toembodiments of the present invention, the system can comprise:

a reads obtaining unit, for obtaining reads from at least one part of anucleic acid molecule of a sample;

a uniquely-mapped reads determining unit, for determininguniquely-mapped reads aligned to a genome reference sequence based onobtained reads;

a number of uniquely-mapped reads calculating unit, for dividing thegenome reference sequence into a plurality of windows, and calculatingthe number of uniquely-mapped reads falling into each of the pluralityof windows

a number of uniquely-mapped reads correcting unit, for subjecting thenumber of uniquely-mapped reads falling into each of the plurality ofwindows to a GC correction, and to a correction based on the expectednumber of uniquely-mapped reads adjusted by a control set to obtain thecorrected number of uniquely-mapped reads;

a candidate breakpoint selecting unit, for calculating a significancevalue of the difference between two numerical populations eachconsisting of the corrected numbers of uniquely-mapped reads fallinginto windows on each of the two sides of a demarcation point, thedemarcation point being a starting point or an ending point of each ofthe plurality of windows, to thereby select the demarcation point havinga smaller significance value as a candidate CNV breakpoint;

a breakpoint determining unit, for calculating a significance value ofthe difference between two numerical populations each consisting of thecorrected number of uniquely-mapped reads falling into windows containedwithin each of two sequences, with one sequence ranging from a given CNVbreakpoint to an adjacent upstream CNV breakpoint, and the othersequence ranging from the given CNV breakpoint to an adjacent downstreamCNV breakpoint, and for removing the candidate CNV breakpoint having theleast significance at every turn and recalculating the significancevalue for the two candidate CNV breakpoints adjacent to the removedcandidate CNV breakpoint, performing cyclic iteration until thesignificance values of all candidate CNV breakpoints are less than atermination threshold value, to thereby determine the CNV breakpoint.

Optionally, each of the plurality of windows can contain the same numberof reference unique reads, or each of the plurality of windows may havethe same length.

Optionally, the termination threshold value can be obtained based on thecontrol set consisting of normal samples.

Optionally, the number of uniquely-mapped reads correcting unitcomprises:

a GC correction module, for grouping the plurality of windows based on aGC content, and obtaining a correction coefficient based on a mean valueof the number of uniquely-mapped reads within one group and a mean valueof the number of uniquely-mapped reads for all of the plurality ofwindows, and then subjecting the number of uniquely-mapped reads fallinginto each of the plurality of windows to the correction, to obtain theGC-corrected number of uniquely-mapped reads;

a window correction module, for calculating a ratio of the GC-correctednumber of uniquely-mapped reads falling into each of the plurality ofwindows in the control set to the total number of uniquely-mapped reads;obtaining a mean value of the ratios for all windows corresponding tothe control set; and calculating the expect number of uniquely-mappedreads for each of the plurality of windows in the sample based on theobtained mean value of the ratios and the total number ofuniquely-mapped reads in the sample.

Optionally, after the CNV breakpoint is determined by the breakpointdetermining unit, the system can further comprise:

a breakpoint filtering unit, for determining a confidence interval ofthe normal corrected number of uniquely-mapped reads using the controlset, based on a distribution of the corrected number of uniquely-mappedreads; and determining an abnormality presenting in a sequence betweenthe two CNV breakpoints, if the mean value of the corrected number ofuniquely-mapped reads within the sequence is outside the confidenceinterval.

Optionally, the corrected number of uniquely-mapped reads can fit anormal distribution, and the confidence interval can be 95%.

Optionally, in the candidate breakpoint selecting unit, a cyclizationwith a single chromosome or a whole genome is performed when selectingthe candidate CNV breakpoint.

Optionally, the system can further comprise:

means for obtaining the sample from a human, with the sample comprisingamniotic fluid obtained by amniocentesis, villus obtained by chorionicvilli sampling, umbilical cord blood obtained by percutaneous umbilicalblood sampling, spontaneous miscarrying fetus tissue or human peripheralblood;

and/or

means for obtaining a genomic DNA of the sample by a DNA extractionmethod such as a salting-out method, a column chromatography method, abeads method, or a SDS method;

and/or

means for subjecting the genome DNA of the sample to random fragmentingby enzyme digestion, pulverization, ultrasound, or HydroShear method, toobtain DNA fragments;

and/or

means for subjecting the DNA fragments to single-end sequencing orpair-end sequencing, to obtain the reads of the DNA fragments.

Optionally, different samples are distinguished by adding differentindexes to each of the DNA fragments of the samples.

One of the advantages of the method of detecting a copy number variationand a system thereof according to embodiments of the present inventionlies in having clinical feasibility, which can accurately detect thecopy number variation including micro-deletion and micro-duplication.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart showing a method of detecting a copy numbervariation according to an embodiment of the present invention;

FIG. 2 is a flow chart showing a method of detecting a copy numbervariation according to another embodiment of the present invention;

FIG. 3 is a flow chart showing a method of detecting a copy numbervariation according to a further embodiment of the present invention;

FIG. 4 is a brief flow chart showing a method of chromosomal CNVanalysis according to an embodiment of the present invention;

FIG. 5 is a schematic diagram showing a system of detecting a copynumber variation according to an embodiment of the present invention;

FIG. 6 is a schematic diagram showing a system of detecting a copynumber variation according to another embodiment of the presentinvention;

FIG. 7A-7H show detection results of eight samples respectively analyzedin an example of the present invention.

DETAILED DESCRIPTION

Terms used herein are explained as below:

Copy number variation (CNV): refers to a change of copy number of atleast a portion of a nucleic acid molecule in a sample to be tested ascompared with the corresponding nucleic acid sequence in a normalsample, wherein the portion has a length of more than 1 kb. Cases andreasons for the copy number variation can include: deletion, such asmicro-deletion; insertion, such as micro-insertion, micro-duplication,duplication; inversion, transposition and complex, multi-site variation.

Aneuploidy: refers to an addition or a reduction of the chromosomenumber presenting in a genetic material as compared with a normalsample, and can include an addition or reduction of an entire or partialchromosome. The copy number variation as used in the present disclosurecan also include aneuploidy.

Sequencing: refers to a process of obtaining information of a nucleicacid sequence of a sample. The sequencing can be performed by variousmethods, including but not limited to, dideoxy chain termination;preferably a high-throughput sequencing method, including but notlimited to, a Next-Generation sequencing technique or a single moleculesequencing technique.

The Next-Generation sequencing platform (Metzker M L. Sequencingtechnologies—the next generation. Nat Rev Genet. 2010 January;11(1):31-46) includes, but are not limited to, Illumina-Solexa (GATM,HiSeq2000™, etc), ABI-Solid and Roche-454 (pryosequencing) sequencingplatform; the single molecule sequencing platform (technique) includes,but are not limited to, True Single Molecule DNA sequencing of HelicosCompany, single molecule real-time (SMRT™) of Pacific BiosciencesCompany, and nanopore sequencing technique of Oxford NanoporeTechnologies Company (Rusk, Nicole (2009-04-01). Cheap Third-GenerationSequencing. Nature Methods 6 (4): 2446 (4)), etc.

A type of sequencing can be single-end sequencing and pair-endsequencing, a sequencing length can be 50 bp, 90 bp or 100 bp. In anembodiment of the present invention, the sequencing platform isIllumina/Solexa, the type of sequencing is pair-end sequencing, toobtain DNA sequence molecule having a length of 100 bp with arelationship of a bi-directional position.

In an embodiment of the present invention, a sequencing depth can bedetermined based on a length of variation fragment in a chromosome of asample to be test. The higher sequencing depth, the higher sensitivityof the detection, i.e., the smaller length of deletion fragment andduplication fragment which can be detected. The sequencing depth can be0.1-30×, i.e., the total amount is 0.1-30 folds in relative to a lengthof human genome, for example, in an embodiment of the present invention,the sequencing depth is 0.1×(2.5×10⁸ bp).

Reads: refers to a nucleic acid sequence having a certain length(generally longer than 20 bp), for example a sequencing result of thesequence generated by sequencer, which can be aligned to a specificregion or location of a reference sequence by a sequence alignmentmethod.

Sequence alignment (aligning): refers to a process of comparing one ormore nucleic acid sequences to a reference sequence. Specifically, anucleic acid sequence having a relative shorter length (such as a read)is aligned to a reference genome sequence, to determine a location ofthe nucleic acid sequence having the relative shorter length in thereference genome. When using a computer to perform sequence alignment,the sequence alignment can be performed by any one of sequence alignmentprocedures, such as ELAND (efficient local alignment of nucleotidedata), SOAP (Short Oligonucleotide Analysis Package) and BWA(Burrows-Wheeler Aligner), etc. A standard for recognizing a successfulalignment is classified into a non-fault-tolerant alignment (100% ofmatches) and partial-fault-tolerant alignment (less than 100% ofmatches).

Uniquely-mapped reads: refers to reads which can be aligned to a uniqueposition of a reference sequence (for example a reference genomesequence).

Reference unique reads: refers to sequences having a fixed length and auniquely-mapped position aligned to a reference sequence (generally areference genome). A process of obtaining the reference unique reads,for example includes, dividing a reference genome into a plurality ofsequences having a fixed length, aligning the plurality of sequences tothe reference genome, selecting a sequence uniquely-aligned to thereference genome as a reference uniquely-mapped sequence. The fixedlength is determined based on a sequence length of a sequencing resultobtained by a sequencer, which can specifically refer to a mean length.Different sequencer can obtain sequencing results of different sequencelength. Specific to every sequencing, the sequence length of thesequencing result can also be different, in which a certain subjectiveand experience factors can exist in the selection thereof.

Index: refers to a nucleic acid sequence having a specific length andplaying a function as a marker. When DNA molecules to be tested derivefrom a plurality of samples to be tested, each of the plurality ofsamples can be added with different indexes, for distinguishing theplurality of samples during sequencing (Micah Hamady, Jeffrey J Walker,J Kirk Harris et al. Error-correcting barcoded primers forpyrosequencing hundreds of samples in multiplex. Nature Methods, 2008,March, Vol. 5 No. 3), to achieve subjecting the plurality of samples tosimultaneous sequencing. The index is for distinguishing differentsequences, without affecting other functions of the DNA molecule addedwith the index.

GC correction: a certain GC bias exists among batches or within onebatch, which may cause copy number bias presenting in a region having ahigh GC content or a low GC content of a genome. The CG correction isperformed with sequencing data based on a control set, to obtaincorrected relative number of reads in each window, by which such biascan be eliminated, and an accuracy of detecting the copy numbervariation may be improved.

Mean value: generally refers to an arithmetic mean value or a median.

The number of uniquely-mapped reads: the number of uniquely-mapped readscan be statistic number obtained by calculating based on the initialnumber, or can be corrected value obtained by subjecting the number ofuniquely-mapped reads to a correction with a correction coefficient, forexample can be a ratio, in some cases can be interchangeable with “copyrate”.

Sample to be tested: can also be called as a test sample in some cases,which refers to a sample including a nucleic acid molecule which issuspected having variation. A type of the nucleic acid is not subjectedto special restriction, which can be desoxyribonucleic acid (DNA), orcan be ribonucleic acid (RNA), preferably is DNA. RNA can be convertedto DNA having a corresponding sequence by conventional means, forsubsequent detections and analysis.

Control sample: is relative to the sample to be tested, which isregarded as a normal sample. Generally, normal means having a normalphenotype.

Control sample set (Control set): refers to a set consisting of thecontrol samples, in an embodiment of the present invention, the numberof the control samples in the control set is required to be more than30.

Reference will be made in detail to embodiments of the presentinvention. The embodiments described herein with reference to drawingsare explanatory, illustrative, and used to generally understand thepresent invention.

With continuous development of high-throughput sequencing technology andgradually decrease of sequencing cost, the sequencing technology hasbeen increasingly and extensively applied in detection of chromosomevariation.

Sequencing technology to detect chromosomal aberrations in terms ofgaining are more widely used.

To improve the technology of detecting a copy number variation inclinic, the present invention designs a technical solution for screeningthe copy number variation in a whole genome level based onhigh-throughput sequencing technology, which has advantages of beinghigh throughput, high specific and high accuracy of alignment. Adetection result can be obtained by obtaining a sample from a subject;extracting DNA; high throughput sequencing, and subjecting obtained datato analysis.

FIG. 1 is a flow chart showing a method of detecting a copy numbervariation according to an embodiment of the present invention.

As shown in FIG. 1, in step 102, reads are obtained from at least onepart of a nucleic acid molecule of a sample. At least one part ofnucleic acid molecule or an entire nucleic acid molecule in a testsample can be subjected to sequencing to obtain reads. Reads of the partof nucleic acid molecule of the test sample can be obtained, or reads ofthe entire nucleic acid molecule can be obtained. For example, a genomicDNA molecule from a test sample is randomly fragmented to obtain DNAfragments, which are then subjected to sequencing to obtain reads havinga certain length. The length of obtained reads can be within a certainrange, and the reads having a fixed length can be obtained bytruncating. The DNA fragments can have a length of 50 bp˜1500 bp, suchas, 50 bp˜150 bp, 150 bp˜350 bp, 350 bp˜500 bp, 500 bp˜700 bp, 700bp˜1000 bp or 1000 bp˜1500 bp. For example, the DNA fragments can have alength of 50 bp, 90 bp, 100 bp, 150 bp, 300 bp, 350 bp, 500 bp, 700 bp,1000 bp, 1500 bp. In an example, 300 bp˜700 bp is preferred, 350 bp˜500bp is more preferred. The length of the reads can having a largedifference due to different sequencer, for example, general sequencelength of equipments such as illumina-solexa and life technologies-solidis within a range of 300 bp, while sequence length obtained byRoche-454, conventional Sanger sequencing, ultramodern single moleculesequencing system may be approximately or exceed 1000 bp. To meet anrequirement for unique alignment, when selecting uniquely-mapped reads,sequences having a length of 20 bp or more are generally selected,preferably, selected sequences have a length of 26 bp or more.

Step 104, uniquely-mapped reads aligned to a genome reference sequenceare determined based on the obtained reads. For example, entire orpartial sequences of the reads are aligned to the genome referencesequence, to obtain site information of the reads in the genome, andfurther obtain site information of the reads on a specific chromosome.For a sample deriving from a human subject, the human genome referencesequence can be a human genome reference sequence in NCBI database. Inan example of the present invention, the human genome reference sequenceis a human genome reference sequence of Build 36 in NCBI database (hg18;NCBI Build 36), alignment software used is SOAPaligner/soap2. Those DNAfragment reads uniquely-mapped to the genome reference sequence areselected, namely, those reads uniquely-mapped to the genome referencesequence only one time, i.e., uniquely-mapped reads aligned to the(genome) reference sequence.

Step 106, the genome reference sequence is divided into a plurality ofwindows, and the number of uniquely-mapped reads falling into each ofthe plurality of windows is calculated. A method for detecting theplurality of windows may comprise: fragmenting a reference genome intofragments having a sequencing length being same as that of the sample tobe tested, which are subjected to alignment by same alignment softwarewith same parameter, to screen out uniquely-mapped positions on achromosome; determining one of the plurality of windows by everyinterval having a certain length of the uniquely-mapped positions. Across-sliding between the pluralities of the windows can or cannot existdepending on selection. The number of uniquely-mapped sites which can beincluded in the plurality of windows relates to sequencing data volumeof the sample to be tested. Generally, the expected reads number of thesample to be tested falling into each of the plurality of windows is 300or more, so as to guarantee the reads number falling into the pluralityof windows fitting Poisson distribution. For example, assuming that thenumber of uniquely-mapped sites of a genome is N, effective reads numberof the sample to be tested is n, and the expected number of readsfalling into each of the plurality of windows in E, then each of theplurality of windows of the reference genome can include

$\frac{N}{n} \times E$

of the uniquely-mapped sites.

Step 108, the number of uniquely-mapped reads falling into each of theplurality of windows is subjected to a GC correction, and to acorrection based on the expected number of uniquely-mapped readsadjusted by a control set to obtain the corrected number ofuniquely-mapped reads. In an example, the step of subjecting the numberof uniquely-mapped reads falling into each of the plurality of windowsto the GC correction and to a correction based on the expected number ofuniquely-mapped reads adjusted by a control set to obtain the correctednumber of uniquely-mapped reads further comprises:

grouping the plurality of windows based on a GC content, and obtaining acorrection coefficient based on a mean value of the number ofuniquely-mapped reads within one group and a mean value of the number ofuniquely-mapped reads for all of the plurality of windows, and thensubjecting the number of uniquely-mapped reads falling into each of theplurality of windows to the correction, to obtain the GC-correctednumber of uniquely-mapped reads;

the expected number of uniquely-mapped reads adjusted by the control setis obtained by following steps:

calculating a ratio of the GC-corrected number of uniquely-mapped readsfalling into each of the plurality of windows in the control set to thetotal number of uniquely-mapped reads;

obtaining a mean value of the ratios for all windows corresponding tothe control set; and

calculating the expect number of uniquely-mapped reads for each of theplurality of windows in the sample based on obtained mean value of theratios and the total number of uniquely-mapped reads in the sample.

Step 110, a significance value of the difference between two numericalpopulations each consisting of the corrected numbers of uniquely-mappedreads falling into windows on each of the two sides of a demarcationpoint is calculated, with the demarcation point being a starting pointor an ending point of each of the plurality of windows, to therebyselect the demarcation point having a smaller significance value (i.e.,more significant difference) as a candidate CNV breakpoint. For example,a predetermined number of windows are selected as candidate CNVbreakpoints within a whole genome range based on a p value presenting asignificance level of copy number variation between two sides of each ofthe plurality of windows, to obtain a significance value for each of thecandidate CNV breakpoints, namely, a p value.

Step 112, a significance value of the difference between two numericalpopulations each consisting of the corrected number of uniquely-mappedreads falling into windows contained within each of two sequences iscalculated, with one sequence ranging from a given CNV breakpoint to anadjacent upstream CNV breakpoint, and the other sequence ranging fromthe given CNV breakpoint to an adjacent downstream CNV breakpoint, andthe candidate CNV breakpoint having the least significance is removed,and the significance value for the two candidate CNV breakpointsadjacent to the removed candidate CNV breakpoint is recalculated. Thisprocess is repeated until the significance values for all remainingcandidate CNV breakpoints are less than a termination threshold value,to thereby determine the CNV breakpoint. The termination threshold valuegenerally is preset. For example, the termination threshold value can beobtained by analyzing and processing a control set consisting of normalsamples.

In the above examples, the selection of a CNV breakpoint is performedby: aligning the obtained reads to the genome reference sequence,calculating the number of uniquely-mapped reads falling into each of theplurality of windows, subjecting the number of uniquely-mapped readsfalling into each of the plurality of windows to a GC correction and toa correction based on a control set, calculating and recalculating asignificance value of difference to repeatedly select candidate CNVbreakpoints, so as to achieve CNV detection. The method can accuratelydetermine relatively small copy number variation includingmicro-deletion/micro-duplication.

For a sample derived from a human, the subject sample can be a genomicDNA obtained from amniotic fluid by amniocentesis, villus obtained bychorionic villi sampling, umbilical cord blood obtained by percutaneousumbilical blood sampling, spontaneous miscarrying fetus tissue or humanperipheral blood. The genomic DNA can be extracted by conventionalextraction methods such as a salting-out method, a column chromatographymethod, a beads method, or a SDS method. In an example, the columnchromatography method is preferred, such column chromatography methodcan comprise: obtaining exposed DNA molecules by subjecting blood,tissue and cells to cell lysis buffer and proteinase K, and combiningthe DNA molecules to a silicone membrane under a condition of highsalinity, and then eluting the DNA molecules from the silicone membraneunder a condition of low salinity and high pH value. A specificprincipal and method can refer to specification of Tiangen TIANamp MicroDNA Kit (DP36).

If the DNA fragments to be detected are derived from a plurality ofsubject samples, the DNA fragments of each subject sample can be addedwith different indexes having a length of 4 bp to 12 bp, fordistinguishing different sample during sequencing (Micah Hamady, JeffreyJ Walker, J Kirk Harris et al. Error-correcting barcoded primers forpyrosequencing hundreds of samples in multiplex. Nature Methods, 2008,March, Vol. 5 No. 3). By such, a plurality of subject samples can besubjected to simultaneous detection, which can improve efficiency anddecrease detecting cost.

FIG. 2 is a flow chart showing a method of detecting a copy numbervariation according to another embodiment of the present invention.

Step 202, a genomic DNA molecule of a subject sample is randomlyfragmented to obtain DNA fragments, which can be performed by enzymedigestion, pulverization, ultrasound, or HydroShear method. Theultrasound method is preferably used, such as S-series from CovarisCompany (based on AFA technique, when sound energy/mechanical energyreleased from a sensor passing through a DNA sample, gas is dissolvedand forms a bubble. After the energy is removed, the bubble is rupturedto generate a capability for fragmenting the DAN fragments. By settingconditions such as a certain energy intensity and time interval, the DNAmolecule can be fragmented into fragments having a certain size. Aspecific principal and method can refer to specification for S-series ofCovaris Company), is used for fragmenting the DNA molecule intofragments having a certain concentrated size.

Step 204, the DNA fragments are subjected to sequencing, to obtainsequencing sequences of the DNA fragments, namely, reads. The readsobtained from sequencing can have a certain length within a range. Thereads having a fixed length can be obtained by subjecting reads of theDNA fragments to truncation. The reads of the DNA fragments used in thepresent example hereafter refer to reads having a fixed length. Methodsused for sequencing may be high-throughput sequencing methods such asIllumina/Hiseq2000, ABI/SOLiD, Roche/454. Sequencing type may beSingle-end sequencing or Pair-end sequencing, and sequencing length maybe 50 bp to 1500 bp. In an example of the present invention, the usedsequencing platform is Illumina/Hiseq2000, the sequencing type isPair-end sequencing, and obtained DNA sequence molecules with a lengthof 100 bp has a relationship of a bi-directional position. Thesequencing depth can be determined based on the length of variationfragment in a chromosome. The higher sequencing depth, the highersensitivity of the detection, i.e., the smaller length of deletionfragment and duplication fragment which can be detected. In an exampleof the present invention, the sequencing volume of the human subjectsample ranges of 2˜900×10⁸ of reads.

Step 206, the reads are aligned to a genome reference sequence, toobtain site information of the reads in a genome.

Step 206, uniquely-mapped reads aligned to a genome reference sequenceis selected.

Step 208, the number of uniquely-mapped reads falling into each of theplurality of windows is determined. For each subject sample, the numberof uniquely-mapped reads falling into each of the plurality of windowsis calculated (recorded as n_(i,j), subscripts I and j respectivelyrepresent No. of the plurality of window and No. of the subject samples,for distinguishing which are omitted for brevity).

Step 210, an average GC content for each of the plurality of window inthe genome is determined to determine a correction coefficient for eachof the plurality of windows; and the corrected number of uniquely-mappedreads for each of the plurality of windows is obtained based on thecorrection coefficient. Such step is mainly to subject the number ofuniquely-mapped reads falling into each of the plurality of windows tothe correction based on the GC content for each of the plurality ofwindows, which can be called as batches correction or GC correction.

The average GC content (recorded as GC_(i,j)) of the uniquely-mappedreads falling into each of the plurality of windows is calculated. Thecalculation of the average GC content (GC_(i,j)) includes calculatingaverage GC content of all uniquely-mapped reads falling into each of theplurality of windows. During calculation, the total number of bases Gand C of all uniquely-mapped reads is recorded as N_(gc), a total lengthof all uniquely-mapped reads is recorded as L, then

${GC}_{i,j} = {\frac{N_{gc}}{L}.}$

To correct a difference of data volume of the subject sample, each ofthe plurality of windows are grouped based on GC_(i,j), i.e., windowshaving same GC_(i,j) are grouped into one group, a median or anarithmetic mean value m_(g,j) for the number of uniquely-mapped reads ineach group is determined, which is divided by a median or an arithmeticmean value m_(j) for the number of uniquely-mapped reads in the wholegenome level, to obtain the correction coefficient c_(g,j)(c_(g,j)=m_(g,j)/m_(j)), in which the subscript g represents GC contentsof different groups. The original number of uniquely-mapped readsfalling into each of the plurality of windows n_(i,j) is multiplied bythe correction coefficient c_(g,j) to obtain a corrected value of thenumber of uniquely-mapped reads falling into each of the plurality ofwindows (recorded as n_(i,j)).

Step 212, the corrected number of uniquely-mapped reads falling intoeach of the plurality of windows is divided by the expected number forthe corresponding each of the plurality of windows, to obtain correctednumber of uniquely-mapped reads for each of the plurality of windows,i.e., copy rate. The expected number for the corresponding each of theplurality of windows is obtained by a control set consisting of normalsamples. Such steps is mainly to subject the number of uniquely-mappedreads for each of the plurality of windows to correction based on dataof normal samples, which can be called as window correction.

In the sample of control set, a percentage of relative number ofuniquely-mapped reads (P_(i,j)) is defined as a ratio of the number ofuniquely-mapped reads within a window (n_(i,j)) to the total number ofuniquely-mapped reads for the whole genome (N_(j)), i.e.,

${P_{i,j} = \frac{n_{i,j}^{\prime}}{N_{j}}};$

and then an average percentage of each of the plurality of windows inthe control set P_(i) is calculated, i.e.,

$\overset{\_}{P_{i}} = {\frac{1}{n}{\sum\limits_{j = 1}^{n}\; {P_{i,j}.}}}$

In the subject sample, the copy rate of each of the plurality of windowsr_(i,j) is obtained by dividing the corrected number of uniquely-mappedreads n_(i,j) by the expected number within the window (multiplying thetotal number of uniquely-mapped reads for the whole genome with thepercentage of the number of uniquely-mapped reads falling into each ofthe plurality of windows), i.e.,

$r_{i,j} = {\frac{n_{i,j}}{\overset{\_}{P_{i}} \times N_{j}}.}$

In selecting the control set, the method of library construction,sequencing reagent and sequencing type should be consistent with thesample to be tested as much as possible, so as to improve effect ofcorrecting the sample to be tested by the control sample. The sample inthe control set should be a normal sample with a sample volume of 30 ormore.

Step 214, a predetermined number of windows are selected as candidateCNV breakpoints within a whole genome range based on a p valuepresenting a significance value of copy number variation between the twosides of each of the plurality of windows, and the p value for each ofthe candidate CNV breakpoints is obtained.

1) Selection of the candidate CNV breakpoints: for all windows of thewhole genome, calculating a difference of copy number variations betweenthe two sides of each of the plurality of windows, each consisting of acertain number of windows (the number of windows is normally more than30 or satisfies a restriction of the lowest sample amount for adetection model, in order to make the detection model to detectsignificant difference); and selecting a certain number (for example, 1%of the total number of windows) of sites (corresponding to the pluralityof windows) as the candidate CNV breakpoints (namely, a demarcationpoint of each CNV fragments), based on the significant difference levelin the whole genome range (from small value to large value of the pvalue).

2) Initialization: a set of all ranked breakpoints is recorded asB_(c)={b₁, b₂, . . . , b_(k) . . . , b_(s)}, then two adjacentbreakpoints k+1 and k−1 exist at two sides of the breakpoint k. Bycalculating a significance value for the difference between a set ofcopy numbers falling into the windows in a sequence from k−1 to k and aset of copy numbers falling into windows in a sequence from k to k+1, ap value representing a significance difference between the two sides ofbreakpoint k is obtained. For example, in an example, a Run test innon-parameter detection is selected, which evaluates the significantdifference of two populations by homogeneous status of distribution withmixed elements of the two populations, specifically referring to Wald,A. & Wolfowitz, J. On a test whether two samples are from the samepopulation. The Annals of Mathematical Statistics 11, 147-162 (1940).

Step 216, the candidate CNV breakpoint having the least significance isremoved, and the significance values for the two candidate CNVbreakpoints adjacent to the removed candidate CNV breakpoint arerecalculated. This process is repeated until the significance values forall candidate CNV breakpoints are less than a final p value (i.e., atermination threshold value), which is obtained based on the controlset.

Iteration merger: the candidate breakpoints having the leastsignificance is removed from each round of cycling and iterating, andthe p values for the two adjacent breakpoints are recalculated until allof the p values are less than the final p value.

The final p value is obtained by: for example, subjecting the controlsamples to the above operation of Iteration merger; recording thelargest p value for each Iteration merger; performing the aboveoperation of Iteration merger until merging into one fragment. By then,according to a change trend of the largest p value, the largest p valuefor one time of Iteration merger corresponding to one site having themost dramatic change of the p value (i.e., selecting one site having themost obvious change of slope in a curve of the p value change (a sitehaving the largest curvature)) or that of the previous time of Iterationmerger is taken as the termination threshold value.

The above steps 214 and 216 can also be called as fragmentation. Notice:when selecting the windows and the breakpoints in 1) and 2) of the step214, a cyclization with a single chromosome or a whole genome can beconsidered to be performed. The cyclization with the single chromosomerefers to: when calculating the plurality of windows near the startingpoint of a chromosome, if the effective window number on the left sideis insufficient for statistical detection, then sufficient number ofwindows for calculation is obtained reversely from the end point of suchchromosome; in a similar way, for those positions, of which thesufficient effective number of windows near the right side of the endpoint is not able to be obtained, are obtained from the front point ofthe chromosome. Such operation makes those windows located at the frontpoint and the end point still can be calculated. The cyclization withthe whole genome is: indexing the end point of the former chromosomewhen the effective number of windows located at the front point of eachchromosome, while indexing the front point of the latter chromosome whenthe effective number of windows located at the end point isinsufficient, while chromosome 1 connects to chromosome Y.

After the step 216, the method further comprises subjecting the sequencebetween two CNV breakpoints to a confidence selection, in which theconfidence selection comprises steps of: determining a confidenceinterval of normal copy rate using the control set, based on adistribution of the copy rate; and determining an abnormality presentingin the sequence between the two CNV breakpoints, if the mean value ofthe copy rate within the sequence is outside the confidence interval. Inan example, the copy rate the copy rate fits a normal distribution, andthe confidence interval is 95%. The fragmenting result is subjected to afiltration by such step, to obtain a reliable result. If the mean valueof R_(i,j)′ is smaller than a lower cut-off value or larger than anupper cut-off value, then a corresponding result is output as a positiveresult.

Selection of termination threshold value can comprise: calculating adistribution of copy rate for a window in each of the control samples,which is based on central-limit theorem, in which reads in the windoware random, then the copy rate r fits a normal distribution, andquantiles of the left and the right with a significance level being as0.05 are selected. A mean value thereof is calculated in the controlset, which are taken as the upper and lower threshold value forscreening copy number variation.

In the above examples, accuracy of detection result is improved by thebatches correction and the window correction. By introducing the controlset, accuracy can also be improved by enlarging the control set, whichcan reduce requirement for initial DNA amount.

FIG. 3 is a flow chart showing a method of detecting a copy numbervariation according to a further embodiment of the present invention.FIG. 3 includes a flow chart showing a method of handling a control setconsisting of normal samples (3A) and a flow chart showing a method ofhandling a subject sample (3B). The control set is mainly used forobtaining data for correcting the subject sample, and for thetermination threshold value being as a terminating condition ofiterating the subject sample.

As shown in FIG. 3, the flow chart 3A comprises:

Step 301A, extracting DNA molecule from the control sample;

Step 311A, randomly fragmenting the DNA molecule of the control sampleinto DNA fragments, which are then subjected to sequencing, to obtainsequencing sequence data of the DNA molecule of the control sample,i.e., reads;

Step 312A, aligning the reads of the control sample to a referencegenome sequence;

Step 313A, calculating the number of uniquely-mapped reads falling intoeach of the plurality of windows, i.e., the number of uniquely-mappedreads;

Step 314A, subjecting the control sample to the batches correction;

Step 315A, obtaining the expected number of the plurality of windows bythe control sample, for subjecting the control sample to the windowcorrection.

Step 316A, selecting breakpoints and fragmenting. The step of selectinga candidate CNV breakpoint comprises: removing the candidate CNVbreakpoint having the least significance and recalculating the p valueof the two candidate CNV breakpoints adjacent to the removed candidateCNV breakpoint, cycling and iterating until the number of rest fragmentsequals to a predetermined value (such as 24).

Step 317A, determining a termination threshold value. By calculating amean value of current final p value, the final p value can beeffectively obtained, as the termination threshold value of thecondition for terminating iteration with the subject sample.

Step 3B comprises:

Step 310B, extracting DNA molecule of a subject sample;

Step 311B, randomly fragmenting the DNA molecule of the control sampleinto DNA fragments, which are then subjected to sequencing, to obtainreads of the DNA molecule of the control sample;

Step 312B, aligning the reads of the DNA molecule of the control sampleto a reference genome sequence;

Step 313B, calculating the number of uniquely-mapped reads falling intoeach of the plurality of windows, i.e., the number of uniquely-mappedreads.

Step 314B, subjecting the subject sample to batches correction.

Step 315B, subjecting the subject sample to window correction based onthe expected number of uniquely-mapped reads falling into each of theplurality of windows for the control sample;

Step 316B, selecting breakpoints and fragmenting;

Step 317B, subjecting obtained result to filtering.

For selecting the control set, a method of library constructing,sequencing reagent and sequencing type should be consistent with thesample to be tested as much as possible, so as to improve effect ofcorrecting the sample to be tested by the control sample. The sample inthe control set should be a normal sample with a sample volume being 30or more.

FIG. 4 is a brief flow chart showing a method of chromosomal CNVanalysis according to an embodiment of the present invention.

As shown in FIG. 4, step 401 is DNA extraction and sequencing: aftergenomic DNA is extracted in accordance with specification of TiangenDP327-02 Kit, a library is constructed in accordance to a process oflibrary construction based on Illumina/Hiseq2000 standard. During suchprocess, those DNA molecules concentrating in 500 bp are ligated, withadaptors at both ends thereof for sequencing, in which each sample isalso added with different indexes respectively, by which data from theplurality of samples can be distinguished in a result of sequencing forone time.

Step 402, sequence alignment: A sequencing method of Illumina/Hiseq2000is used for sequencing (other sequencing methods such as ABI/SOLiD canalso achieve a same or similar effect), each sample can obtain reads ofDNA fragments having a certain length, which is aligned to a standardhuman genome reference sequence in NCBI database using SOAP2, to obtaininformation of a corresponding position where the reads locate in thegenome. To avoid interference of repetitive sequence to CNV analysis,only those reads uniquely mapped to the human genome reference sequenceare selected, i.e., those reads can be mapped to the human genomereference sequence only once, also called as the number ofuniquely-mapped reads, as the effective data for subsequent CNVanalysis.

Step 403, PSCC analysis. A serious of bioinformatics methods ofdetecting a copy number variation in a whole genome copy which areindependently developed by the inventor of the present invention areutilized, which comprises: subjecting the subject sample to batchescorrection, subjecting the subject sample to window correction based ona control set, normalization and fragmentation (segmentation).

Step 404, the fragments having the determined copy number in the step403 are subjected to CNV analysis, a copy rate≦0.7 and a copy rate≧1.3of the subject sample are taken as a cut-off value for detectingdeletion and duplication of the fragments, by which the fragments havingthe copy number variation in a whole genome level is obtained byanalysis, and then obtained result is subjected to visualization.

In the above examples, used software algorithm is a series procedure ofdetecting a copy number variation for a whole genome developed byShenzhen BGI, which are called as PSCC. It can generate data by theNext-Generation sequencing technology, subjecting the subject sample tobatches correction, and performing data correction with a control set,normalization and segmentation, to estimate degree and size of copynumber variation of the subject sample. Under a condition of a lowersequencing depth (50 M of sequencing short-sequences), about 0.5 Mb offragments having a single copy number variation (CNV) can be detected.

FIG. 5 is a schematic diagram showing a system of detecting a copynumber variation according to an embodiment of the present invention. Asshown in FIG. 5, the system can comprise: a reads obtaining unit 51, forobtaining reads from at least one part of a nucleic acid molecule of asample; a uniquely-mapped reads determining unit 52, for determininguniquely-mapped reads aligned to a genome reference sequence based onobtained reads; the number of uniquely-mapped reads calculating unit 53,for dividing the genome reference sequence into a plurality of windows,and calculating the number of uniquely-mapped reads falling into each ofthe plurality of windows; the number of uniquely-mapped reads correctingunit 54, for subjecting the number of uniquely-mapped reads falling intoeach of the plurality of windows to a GC correction, and to a correctionbased on the expected number of uniquely-mapped reads adjusted by acontrol set to obtain the corrected number of uniquely-mapped reads; acandidate breakpoint selecting unit 55, for calculating a significancevalue of the difference between two numerical populations eachconsisting of the corrected numbers of uniquely-mapped reads fallinginto windows on each of the two sides of a demarcation point, with thedemarcation point being a starting point or an ending point of each ofthe plurality of windows, to thereby select the demarcation point havinga smaller value of the significance as a candidate CNV breakpoint; abreakpoint determining unit 56, for calculating a significance value ofthe difference between two numerical populations each consisting of thecorrected number of uniquely-mapped reads falling into windows containedwithin each of two sequences, with one sequence ranging from a given CNVbreakpoint to an adjacent upstream CNV breakpoint, and the othersequence ranging from the given CNV breakpoint to an adjacent downstreamCNV breakpoint, and removing the candidate CNV breakpoint having theleast significance and recalculating the significance values for the twocandidate CNV breakpoints adjacent to the removed candidate CNVbreakpoint, cycling and iterating until the significances of allcandidate CNV breakpoint are less than a termination threshold value, tothereby determine the CNV breakpoint; the termination threshold value isobtained based on the control set consisting of normal samples. When thenumber of uniquely-mapped reads calculating unit 53 divides theplurality of windows, each of the plurality of windows can contain thesame number of reference unique reads, or each of the plurality ofwindows can have a same length. In an example, in the candidatebreakpoint selecting unit 55, a cyclization with a single chromosome ora whole genome is performed when selecting the candidate CNV breakpoint.

In the above examples, the uniquely-mapped reads determining unitdetermines uniquely-mapped reads which can be uniquely aligned to the(genomic) reference sequence based on obtained reads, the number ofuniquely-mapped reads correcting unit subjects the number ofuniquely-mapped reads falling into each of the plurality of windows to acorrection, and the candidate breakpoint selecting unit and thebreakpoint determining unit perform cycling and iterating of genesignificant difference for selecting the CNV breakpoints, by which CNVdetection can be completed, which can accurately detect a regioncomprising the copy number variation including smallermicro-deletion/micro-duplication.

FIG. 6 is a schematic diagram showing a system of detecting a copynumber variation according to anther embodiment of the presentinvention. As shown in FIG. 6, the system comprises: a reads obtainingunit 51, a uniquely-mapped reads determining unit 52, the number ofuniquely-mapped reads calculating unit 53, the number of uniquely-mappedreads correcting unit 64, a candidate breakpoint selecting unit 55 and abreakpoint determining unit 56. The reads obtaining unit 51, theuniquely-mapped reads determining unit 52, the number of uniquely-mappedreads calculating unit 53, the candidate breakpoint selecting unit 55and the breakpoint determining unit 56 can refer to specific descriptionin FIG. 5, which is omitted for brevity. The number of uniquely-mappedreads correcting unit 64 further includes a GC correction module 641 anda window correction module 642. In which, the GC correction module isused for grouping the plurality of windows based on a GC content, andobtaining a correction coefficient based on a mean value of the numberof uniquely-mapped reads within one group and a mean value of the numberof uniquely-mapped reads for all of the plurality of windows, and thensubjecting the number of uniquely-mapped reads falling into each of theplurality of windows to the correction, to obtain the GC-correctednumber of uniquely-mapped reads; the window correction module 642 isused for calculating a ratio of the GC-corrected number ofuniquely-mapped reads falling into each of the plurality of windows inthe control set to the total number of uniquely-mapped reads; obtaininga mean value of the ratios for all windows corresponding to the controlset; and calculating the expect number of uniquely-mapped reads for eachof the plurality of windows in the sample based on obtained mean valueof the ratios and the total number of uniquely-mapped reads in thesample, also called as copy rate.

In an example of the present invention, the system further comprises: abreakpoint filtering unit 67, for determining a confidence interval ofthe normal corrected number of uniquely-mapped reads using the controlset, based on a distribution of the corrected number of uniquely-mappedreads; and determining an abnormality presenting in the sequence betweenthe two CNV breakpoints, if the mean value of the corrected number ofuniquely-mapped reads within the sequence is outside the confidenceinterval. In an example, the corrected number of uniquely-mapped readsfits a normal distribution, and the confidence interval is 95%.

In an example, the system further comprises: means for obtaining thesample from a human, with the sample comprising amniotic fluid obtainedby amniocentesis, villus obtained by chorionic villi sampling, umbilicalcord blood obtained by percutaneous umbilical blood sampling,spontaneous miscarrying fetus tissue or human peripheral blood; and/ormeans for obtaining a genomic DNA of the sample by a DNA extractionmethod such as a salting-out method, a column chromatography method, abeads method, or a SDS method; and/or means for subjecting the genomeDNA of the sample to random fragmenting by enzyme digestion,pulverization, ultrasound, or HydroShear method, to obtain DNAfragments; and/or means for subjecting the DNA fragments to single-endsequencing or pair-end sequencing, to obtain the reads of the DNAfragments. In an example, different samples are distinguished by addingdifferent indexes to each of the DNA fragments of the samples.

Functions of each unit in FIG. 5 and FIG. 6, can refer to specificationof the corresponding part above according to the method in examples ofthe present invention, which are omitted for brevity.

Those skilled in the art will recognize hardware, firmware and softwareconfigurations in these cases are interchangeable, to achieve thefunction of each specific application as much as possible.

Reference will be made in detail to examples of the present invention.It would be appreciated by those skilled in the art that the followingexamples are explanatory, and cannot be construed to limit the scope ofthe present invention. If the specific technology or conditions are notspecified in the examples, a step will be performed in accordance withthe techniques or conditions described in the literature in the art orin accordance with the product instructions. If the manufacturers ofreagents or instruments are not specified, the reagents or instrumentscan be commercially available. Descriptions in following bracketsrespectively illustrate catalog No. of different manufacturers forvarious reagents or kits. The adaptor and index used for sequencingderive from Multiplexing Sample Preparation Oligonutide Kit of IlluminaCompany.

Application Example 1 Detection with 2 Samples Having Chromosome NumberVariation and 6 Samples Having Micro-Deletion

1. DNA Extraction

8 samples (hereinafter referred as Sample1, Sample2, Sample3, . . . ,Sample8) were subjected to DNA extraction in accordance withspecification of TIANamp Micro DNA Kit (DP316) of Tiangen. The extractedDNA was subjected to library construction in accordance with revisedspecification of Illumina/Hiseq2000. Those DNA molecules concentratingin a length of 500 bp were ligated with an adaptor at both ends thereoffor sequencing. Each sample was added with different indexesrespectively, which were then subjected to hybridization with acomplementary adaptor on Flowcell surface, by which nucleic acidmolecules clustering grew under a certain condition, which were thensubjected to Pair-end sequencing on Illumina Hiseq2000, to obtainsequences of DNA fragments having a length of 100 bp.

In details, about 100 ng of DNA obtained from the above amniotic fluidsample (Quant-IT dsDNA HS Assay kit), were subjected to libraryconstruction in accordance with revised specification ofIllumina/Hiseq2000, of which detailed procedure could refer to prior art(specification of standard library construction of Illumina/Solexaprovided in http://www.illumina.com/). The obtained DNA library andinsert fragments were determined having a length of 500 bp using2100Bioanalyzer (Agilent), which were subjected to sequencing oncomputer after accurate quantification by QPCR.

2. Sequencing:

in the current example, above DNA obtained from the 8 samples weresubjected to sequencing on computer in accordance with specifications ofClusterStation and Hiseq2000 (PEsequencing) officially published byIllumina/Solexa, to obtain about 5G of data volume for each sample,which were distinguished by ligated indexes. Using alignment softwareSOAP2, DNA sequence obtained by sequencing was aligned to a human genomereference sequence Build 36 in NCBI database (hg18; NCBIBuild36), toobtain information of a corresponding position where the reads locate inthe genome.

3. Data Analysis

a) basic statistics: calculating the number of uniquely-mapped readsfalling into each of the plurality of windows (i.e., the number ofuniquely-mapped reads, recorded as n_(i,j) subscripts I and jrespectively represented No. of the plurality of window and No. of thesubject samples, for distinguishing, which are omitted for brevity), andan average GC content (recorded as GC_(i,j)).

b) batches correction: to correct a difference of sample data volume,the plurality of windows were grouped based on the average GC content ofreads falling into thereof. A median or an arithmetic average for eachgroup was divided by a median or an arithmetic average for the wholegenome, to obtain a correction coefficient c_(g), in which the subscriptg represented GC contents of different groups. The original number ofuniquely-mapped reads falling into each of the plurality of windowsn_(i,j) was multiplied by the correction coefficient c_(g) to obtain acorrected value of the number of uniquely-mapped reads falling into eachof the plurality of windows (recorded as n_(i,j)′)

c) data correction: sequencing data obtained from 90 samples in YHpopulation were selected as the control set. In the sample of controlset, a percentage of relative number of uniquely-mapped reads (P_(i,j))was defined as a ratio of the number of uniquely-mapped reads within awindow (n_(i,j)′) to the total number of uniquely-mapped reads for thewhole genome (N_(j)), i.e.,

${P_{i,j} = \frac{n_{i,j}^{\prime}}{N_{j}}};$

and then an average percentage of each of the plurality of windows inthe control set P_(i) was calculated, i.e.,

$\overset{\_}{P_{i}} = {\frac{1}{n}{\sum\limits_{j = 1}^{n}\; {P_{i,j}.}}}$

In the subject sample, the copy rate of each of the plurality of windowsr_(i,j) was obtained by dividing the corrected number of uniquely-mappedreads n_(i,j)′ by the expected number within the window (multiplying thetotal number of uniquely-mapped reads for the whole genome with thepercentage of the number of uniquely-mapped reads falling into each ofthe plurality of windows), i.e.,

$r_{i,j} = {\frac{n_{i,j}}{\overset{\_}{P_{i}} \times N_{j}}.}$

d) fragmentation (segmentation)

{circle around (1)} selection of a candidate breakpoint: for all r_(i,j)in the whole genome, calculating a difference in copy number changesbetween two sides of a point, each side having 100 windows; andselecting 10000 points with the most significant differences as thecandidate CNV breakpoints, based on the significant difference level inthe whole genome range (from small value to large value of the p value).

{circle around (2)} initialization: a set of all ranked breakpoints wasrecorded as B_(c)={b₁, b₂, . . . , b_(k) . . . , b_(s)}, then twoadjacent breakpoints k+1 and k−1 existed at two sides of breakpoint k.By calculating a significance value of the difference between a set ofcopy numbers falling into windows from k−1 to k and a set of copynumbers falling into windows from k to k+1, a p value representing asignificant difference between the two sides of breakpoint k wasobtained (a Run test of non-parameter detection was used in the currentexample).

{circle around (3)} Iteration merger included: by continuous cycling anditerating, the candidate breakpoints having the least significance isremoved from each round, and the p values for the two adjacentbreakpoints are recalculated until all of the p values are less than1E⁻⁵⁰.

Notice: when selecting the windows and the breakpoints in steps {circlearound (1)} and {circle around (2)} a cyclization with the whole genomewas performed, to ensure that those windows located at the front pointand the end point of a chromosome still can be calculated.

e) termination threshold value and filtration: a result obtained afterfragmentation was subjected to filtering. If the mean value of r_(i,j)for a fragment was smaller than 0.7 or larger than 1.3, a correspondingresult was output as a positive result.

f) visualization of the positive result

4. Statistical Calculation

For the 8 samples, the detection result and verification result wereshown in details in Table below.

In particular, the verification result was obtained by CGH chip(comparative genomic hybridization chip). Human Genome CGH MicroarrayKit (Agilent Technologies Inc.) was used in the experiments inaccordance with specification provided by manufacturer.

TABLE 1 CNV results of 8 samples Sample No. Detection result lengthVerification result Determination Sample 1 T7 159.14M T7 consistentSample 2 XYY 59.37M XYY consistent Sample 3 chr5: 1 . . . 3687790336.88M 5p15.3~p13.2 consistent deletion (183931~36816731) × 1; chr18:39024931 . . . 76038278 37.01M 18p12.3~q23 addition (39086755~76067279)× 3 Sample 4 chr5: 1 . . . 17710089 17.71M 5p15.33~p15.1 × 1 consistentdeletion Sample 5 chr15: 21236149 . . . 26219186 4.98M 15q11.2~q13.1 × 1consistent deletion Sample 6 chr1: 1 . . . 5065299 5.07M 1p36.33~p36.32× 3 consistent addition Sample 7 chr5: 105562861 . . . 106156933 0.59M5q21.3 × 1 consistent deletion Sample 8 chr9: 6851755 . . . 72484160.40M 9p24.1 × 1 consistent deletionIn Table 1, chr represented chromosome, T7 represented trisomy ofchromosome 7, and XYY represented trisomy variation of sex chromosome.

FIG. 7A-7H show detection results of eight samples respectively in anexample of the present invention.

It could be seen from the above Table 1 and FIG. 7A-7H that, as small as0.4M of fragments having micro-deletion and as large as entirechromosome number variation could be accurately detected and located bythe method of the present invention, which demonstrated the excellentefficiency and accuracy of the present invention.

Comparing with those analysis methods for detecting the copy numbervariation that have been reported thus far, advantages of the presentinvention mainly include:

1) resolution: by using about 50 M of data, a region of 0.5 M havingmicro-deletion can be accurately detected.

2) expandability: in addition to increasing sequencing volume, theaccuracy is also improved by enlarging the control set, which can reducerequirement for initial DNA amount.

3) more stable and more comprehensive: there is few detailed descriptionin reported articles, while the present disclosure describes variousaspects such as batch correction and population correction with data, aswell as optimization with condition for fragmentation, etc.

The method of the present invention can be used for whole genome copynumber variation detection in the target patients, which can facilitategenetic counseling, provide basis for clinical decision, and make anaccurate pathology determination for patients suffering frommicro-deletion syndrome. The target patients of the present inventioncan be patients suffering from micro-deletion or potential carriers. Theexamples of target patients are used for illustrating the presentinvention, and should not be used to limit the scope of the presentinvention.

Although illustrative embodiments have been shown and described, itwould be appreciated by those skilled in the art that the aboveembodiments cannot be construed to limit the present invention, andchanges, alternatives, and modifications can be made in the embodimentswithout departing from the spirit, principles and scope of the presentinvention.

1. A method of detecting a copy number variation comprising followingsteps: obtaining reads from at least one part of a nucleic acid moleculeof a sample, determining uniquely-mapped reads aligned to a (genomic)reference sequence based on the obtained reads, dividing the genomicreference sequence into a plurality of windows, and calculating thenumber of uniquely-mapped reads falling into each of the plurality ofwindows, subjecting the number of uniquely-mapped reads falling intoeach of the plurality of windows to a GC correction, and to a correctionbased on an expected number of uniquely-mapped reads adjusted by acontrol set to obtain a corrected number of uniquely-mapped reads,calculating a significance value of the difference between two numericalpopulations each consisting of the corrected numbers of uniquely-mappedreads falling into windows on each of the two sides of a demarcationpoint, the demarcation point being a starting point or an ending pointof each of the plurality of windows, to thereby select the demarcationpoint having a smaller significance value as a candidate CNV breakpoint;calculating a significance value of the difference between two numericalpopulations each consisting of the corrected number of uniquely-mappedreads falling into windows contained within each of two sequences, withone sequence ranging from a given candidate CNV breakpoint to anadjacent upstream candidate CNV breakpoint, and the other sequenceranging from the given candidate CNV breakpoint to an adjacentdownstream candidate CNV breakpoint, and removing the candidate CNVbreakpoint having the least significance at every turn and recalculatingthe significance value for the two candidate CNV breakpoints adjacent tothe removed candidate CNV breakpoint, performing cyclic iteration untilthe significance values of all candidate CNV breakpoints are less than atermination threshold value, to thereby determine the CNV breakpoint. 2.The method of claim 1, further comprising a step of: subjecting the atleast one part of the nucleic acid molecule of the sample to sequencing,to obtain the reads.
 3. The method of claim 1, wherein each of theplurality of windows contains the same number of reference unique reads,or each of the plurality of windows has the same length.
 4. The methodof claim 1, wherein the termination threshold value is obtained based onthe control set consisting of normal samples.
 5. The method of claim 1,wherein the step of subjecting the number of uniquely-mapped readsfalling into each of the plurality of windows to the GC correction andto a correction based on the expected number of uniquely-mapped readsadjusted by the control set to obtain the corrected number ofuniquely-mapped reads further comprises one or more selected from thegroup consisting of: grouping the plurality of windows based on a GCcontent, and obtaining a correction coefficient based on a mean value ofthe number of uniquely-mapped reads within one group and a mean value ofthe number of uniquely-mapped reads for all of the plurality of windows,and then subjecting the number of uniquely-mapped reads falling intoeach of the plurality of windows to the correction, to thereby obtainthe GC-corrected number of uniquely-mapped reads; and obtaining theexpected number of uniquely-mapped reads adjusted by the control set byfollowing steps: calculating a ratio of the GC-corrected number ofuniquely-mapped reads falling into each of the plurality of windows inthe control set to the total number of uniquely-mapped reads; obtaininga mean value of the ratios for all windows corresponding to the controlset; and calculating the expect number of uniquely-mapped reads for eachof the plurality of windows in the sample based on obtained mean valueof the ratios and the total number of uniquely-mapped reads in thesample.
 6. The method of claim 5, wherein a cyclization with a singlechromosome or a whole genome is performed when selecting the candidateCNV breakpoint.
 7. The method of claim 1, wherein after the CNVbreakpoint is determined, the method further comprises: subjecting asequence between two CNV breakpoints to a confidence selection, whereinthe confidence selection comprises steps of: determining a confidenceinterval of the normal corrected number of uniquely-mapped reads usingthe control set, based on a distribution of the corrected number ofuniquely-mapped reads; and determining an abnormality presenting in thesequence between the two CNV breakpoints, if the mean value of thecorrected number of uniquely-mapped reads within the sequence is outsidethe confidence interval.
 8. The method of claim 7, wherein the correctednumber of uniquely-mapped reads fits a normal distribution, and theconfidence interval is 95%.
 9. The method of claim 1, wherein the methodfurther comprises one or more selected from the group consisting of:obtaining the sample from a human, the sample comprising amniotic fluidobtained by amniocentesis, villus obtained by chorionic villi sampling,umbilical cord blood obtained by percutaneous umbilical blood sampling,spontaneous miscarrying fetus tissue or human peripheral blood;obtaining a genomic DNA of the sample by a DNA extraction method such asa salting-out method, a column chromatography method, a beads method, ora SDS method; subjecting the genome DNA of the sample to randomfragmenting by enzyme digestion, pulverization, ultrasound, orHydroShear method, to obtain DNA fragments; and subjecting the DNAfragments to single-end sequencing or pair-end sequencing, to obtain thereads of the DNA fragments.
 10. The method of claim 1, wherein themethod further comprises: adding different indexes to each of the DNAfragments of the samples, to distinguish different samples.
 11. A systemfor detecting a copy number variation comprising: a reads obtainingunit, for obtaining reads from at least one part of a nucleic acidmolecule of a sample; a uniquely-mapped reads determining unit, fordetermining uniquely-mapped reads aligned to a genome reference sequencebased on obtained reads; a number of uniquely-mapped reads calculatingunit, for dividing the genome reference sequence into a plurality ofwindows, and calculating the number of uniquely-mapped reads fallinginto each of the plurality of windows; a number of uniquely-mapped readscorrecting unit, for subjecting the number of uniquely-mapped readsfalling into each of the plurality of windows to a GC correction, and toa correction based on the expected number of uniquely-mapped readsadjusted by a control set to obtain the corrected number ofuniquely-mapped reads; a candidate breakpoint selecting unit, forcalculating a significance value of the difference between two numericalpopulations each consisting of the corrected numbers of uniquely-mappedreads falling into windows on each of the two sides of a demarcationpoint, the demarcation point being a starting point or an ending pointof each of the plurality of windows, to thereby select the demarcationpoint having a smaller significance value as a candidate CNV breakpoint;a breakpoint determining unit, for calculating a significance value ofthe difference between two numerical populations each consisting of thecorrected number of uniquely-mapped reads falling into windows containedwithin each of two sequences, with one sequence ranging from a givencandidate CNV breakpoint to an adjacent upstream candidate CNVbreakpoint, and the other sequence ranging from the given candidate CNVbreakpoint to an adjacent downstream candidate CNV breakpoint, andremoving the candidate CNV breakpoint having the least significance atevery turn and recalculating the significance value for the twocandidate CNV breakpoints adjacent to the removed candidate CNVbreakpoint, performing cyclic iteration until the significance values ofall candidate CNV breakpoints are less than a termination thresholdvalue, to thereby determine the CNV breakpoint.
 12. The system of claim11, wherein each of the plurality of windows contains the same number ofreference unique reads, or each of the plurality of windows has the samelength.
 13. The system of claim 11, wherein the termination thresholdvalue is obtained based on the control set consisting of normal samples.14. The system of claim 11, wherein the number of uniquely-mapped readscorrecting unit comprises: a GC correction module, for grouping theplurality of windows based on a GC content, and obtaining a correctioncoefficient based on a mean value of the number of uniquely-mapped readswithin one group and a mean value of the number of uniquely-mapped readsfor all of the plurality of windows, and then subjecting the number ofuniquely-mapped reads falling into each of the plurality of windows tothe correction, to obtain the GC-corrected number of uniquely-mappedreads; a window correction module, for calculating a ratio of theGC-corrected number of uniquely-mapped reads falling into each of theplurality of windows in the control set to the total number ofuniquely-mapped reads; obtaining a mean value of the ratios for allwindows corresponding to the control set; and calculating the expectnumber of uniquely-mapped reads for each of the plurality of windows inthe sample based on the obtained mean value of the ratios and the totalnumber of uniquely-mapped reads in the sample.
 15. The system of claim11, wherein after the CNV breakpoint is determined by the breakpointdetermining unit, the system further comprises: a breakpoint filteringunit, for determining a confidence interval of the normal correctednumber of uniquely-mapped reads using the control set, based on adistribution of the corrected number of uniquely-mapped reads; anddetermining an abnormality presenting in the sequence between the twoCNV breakpoints, if the mean value of the corrected number ofuniquely-mapped reads within the sequence is outside the confidenceinterval.
 16. The system of claim 15, wherein the corrected number ofuniquely-mapped reads fits a normal distribution, and the confidenceinterval is 95%.
 17. The system of claim 11, wherein the system furthercomprises one or more selected from the group consisting of: means forobtaining the sample from a human, with the sample comprising amnioticfluid obtained by amniocentesis, villus obtained by chorionic villisampling, umbilical cord blood obtained by percutaneous umbilical bloodsampling, spontaneous miscarrying fetus tissue or human peripheralblood; means for obtaining a genomic DNA of the sample by a DNAextraction method such as a salting-out method, a column chromatographymethod, a beads method, or a SDS method; means for subjecting the genomeDNA of the sample to random fragmenting by enzyme digestion,pulverization, ultrasound, or HydroShear method, to obtain DNAfragments; and means for subjecting the DNA fragments to single-endsequencing or pair-end sequencing, to obtain the reads of the DNAfragments.
 18. The system of claim 11, wherein different samples aredistinguished by adding different indexes to each of the DNA fragmentsof the samples.
 19. The system of claim 11, wherein in the candidatebreakpoint selecting unit, a cyclization with a single chromosome or awhole genome is performed when selecting the candidate CNV breakpoint.20. The method of claim 1, wherein the sample is obtained from a subjectin need of determining the copy number variation, and one or more of thesteps of the method are computer-implemented.